10/4/2020

Are you using the best metric?

I’ll just use the default for the ML package and target

turn to slide 3

If I do something different to what we always do, I will get lots of hassle

turn to slide: 3

It probably won’t make much difference; it’s a lot easier to use standard stuff and avoid questions

turn to slide: 3

Maybe worth thinking about…?

turn to slide: 4

[head shaking hmmmm]

Motivating Example

Example dataset

A dataset where the target is events that happen infrequently

  • 1,764 records
  • 2 features: feature_1 & feature_2
  • event column with 86 ones
  • event rate of 4.875%

Event rate across feature 1

Event rate across feature 2

What would be a good metric of loss?

  • A binary target means classification
    so maybe ROC AuC? GINI? Accuracy? GLM Generalised \(R^2\) …


  • The website neptune.ai explains “24 Evaluation Metrics for Binary Classification (And When to Use Them)”  Kappa, Brier Score, F2…

Metrics for binary classification

Confusion Matrix

False positive rate | Type-I error

False negative rate | Type-II error

True negative rate | Specificity

Negative predictive value

False discovery rate

True positive rate | Recall | Sensitivity

Positive predictive value | Precision

Accuracy

F beta score | F1 score | F2 score

Off the page

Cohen Kappa

Matthews correlation coefficient

ROC curve | ROC AUC score

Precision-Recall curve

PR AUC | Average precision

Log loss

Brier score

Cumulative gain chart

Lift curve | Lift chart

Kolmogorov-Smirnov plot | statistics

Class Imbalance

  • One of the big issues we often have is class imbalance

  • Our dataset’s event rate is 4.88%. That means over 95% of the outcome variable is a constant ‘zero’, with the rest being ‘one’

  • A loss function sums the loss over each record/row (equally or weighted). Predicting 0 will get a zero loss 95% of the time so your loss function will be happy with that (this can be mitigated by weighting)

  • Besides weighting there are other interesting ideas for dealing with imbalanced datasets, one article to check out is svds.com

  • 2 of the most talked about metrics in binary classification are

    • ROC AuC
    • PR AuC

ROC - Area under Curve

It focuses on the trade-off between True Positives and False Positives

Quoting neptune.ai in the section on ROC - Area under Curve

When to use it:

“You should use it when you care equally about positive and negative classes. It naturally extends the imbalanced data discussion. If we care about true negatives as much as we care about true positives then it totally makes sense to use ROC AUC.”

Precision Recall - Area under Curve

From neptune.ai in the section on PR - Area under Curve

"When to use it:

  • when you want to communicate precision/recall decision to other stakeholders

  • when you want to choose the threshold that fits the business problem.

  • when your data is heavily imbalanced. Since PR AUC focuses mainly on the positive class (PPV and TPR) it cares less about the frequent negative class.

  • when you care more about positive than negative class. If you care more about the positive class and hence PPV and TPR you should go with Precision-Recall curve and PR AUC (average precision). "

Definitions

True Positive : prediction is positive (e.g. fraud) and the label is positive


False Positive : prediction is positive and the label is negative


True Negative : prediction is negative and the label is negative


False Negative : prediction is negative and the label is positive

Definitions for ROC AuC

True Positive Rate: # true positives over # all positives

False Positive Rate: # false positives over # all negatives


ROC AuC measures how well your model trades-off having good TPR without having bad FPR


Note:
False Positive Rate = 1 - True Negative Rate
__True Negative Rate: # true negatives over # all negatives


ROC AuC measures how well your model does on both TPR and TNR

Definitions for PR AuC

Precision : # true positives over # all predicted positive


Recall : # true positives over # all positives (aka True Positive Rate)



Precision Recall Curve AuC looks at how the model is doing with respect to both of these at the same time



Recap: what each metric likes

Example with type B as class of interest

Example with type B as class of interest

Example with type B as class of interest

Example with type B as class of interest

Plotting the trade-off (ROC curve)

Let’s go back to our example dataset

  • 1,764 records; 2 features; a target with 86 (4.875%) “1”s and the rest “0”


  • The class imbalance is one thing to take into consideration


  • We also need to take into consideration our general problem at hand and what kind of trade-offs we want to make in performance


  • There is one other thing which we should consider: Is this a Classification problem or a Regression problem?

Our example dataset with ‘underlying truth’

What are we after?

  • The risk gradient is ideally what we are after

  • We would like to compare a model’s estimate of risk rate at the datapoints against the true risk (regression)

  • But we don’t know the true risk gradient in amy form, we only have events

  • Events are a binary outcome so the tools of classification are really all we have

  • We are modelling Propensity



  • Events are our observable variable as a binary outcome

  • Risk is a latent (not directly observable) variable (e.g. a [0.0 to 1.0] propensity across the population)

False Positive?

  • In our example the event rate goes up to about 22%, so everyone is more likely to not be associated with an event

  • A subject with a low risk rate can have a event, and another with higher risk (propensity) may easily have none

  • There are no false positives or false negatives or true negatives…

  • With this in mind, let’s look again at our classification metrics

Understanding the usual metrics in our context

Instead of thinking of the boundary as separating positive and negative areas, we can consider the separation to be of high risk (or propensity) and low risk.

We have events, but we can assign no True or False to them, both high- and low- risk cases can have an event or absence of event

Old description Our description
predicted positive tagged high risk
predicted negative tagged low risk
actual positive event (insurance claim, fraud)
actual negative no-event

Re-labelling ROC AuC

ROC AUC: calibrating the balance of True Positive rate and True Negative Rate

Old description Our description
True Positive rate (# events in high risk region) / (# events overall)
True Negative rate (# no-event in low risk region) / (# no-events overall)

Re-labelling PR AuC

PR AUC: calibrating the balance of Recall (True Positive rate) and Precision

Old description Our description
Recall (# events in high risk region) / (# events overall)
Precision (# events in high risk region) / (# records in high risk region)

Logistic model derived boundary

What do we want?

  • ROC AuC implies we want to have more events in the high region than in the low region and also more non-events in the low region than in the high region

  • What we may want is to have a higher event rate in the high region (Precision) and a lower event rate in the low region (in classification the equivalent metric is the Negative Predictive Power).

  • Often business objectives are focused on the events (e.g. claims, cancellations, purchases) and in a fairly imbalanced setting are interested in the small proportion of the population which would have a higher propensity

  • PR AuC considers two things. It likes a boundary where the

    • Region of high risk has a (relatively) high event rate
    • As much of the events as possible lie in the region of high risk

Rate of events across prediction deciles

Are ROC AuC and PR AuC getting at the same thing?

  • They are pretty similar
  • ROC AuC is concerned with getting as many events in B as possible and as few in A as possible
  • ROC PR is concerned with getting as many events in B as possible and a high event rate in B
  • For balanced classification datasets they are equivalent, but in problems with high imbalance they are not
  • PR is not so concerned about what is happening in the A region. Intuitively this can be a positive feature when you are looking for rare events and the ‘no events’ in A will overwhelm your metric

An aside on the Gini metric

  • Gini is the ROC AuC measure on a different scale

  • Used in a lot in financial settings

  • It has a number of different definitions, a useful one in a financial settings is: the Gini coefficient is proportional to the covariance between a variable and its rank

  • This definition shows why it is used to evaluate scores: a good score (rank) should be correlated (covariance) with the outcome

  • If we wish to use a score for cutoffs for regions of difference risk/propensities, particularly in highly unbalanced dataset, it may not be the best metric

A tale of two models

  • A (toy) dataset of 2,020 data points

  • In these we find 20 events, so about 1% rate

  • We have developed 2 candidate models

  • And we will follow the well worn path of ROC AuC …

A tale of two models: ROC AuC

A tale of two models

  • Model 1 finds an area of feature space where there is a very high density of cases, about a 100% case rate. In that very high propensity area about half of our events are found

  • Beyond that area the events happen at a baseline rate of about 1 event per 200

  • Model 2 doesn’t doesn’t identify the high propensity region but does find a gradient in the feature space.

  • Model 2 is a bit better ‘on average’ so it is the winner by ROC AuC

  • Should we really be integrating the performance over all possible cutoff levels? Partial AuC is sometimes used, and may be worth looking at

A tale of two models: PR AuC

On their own adventure…

On their own adventure…

What about the F-Score? other metrics?

  • If we think PR analysis is better why not consider an F-score?

  • An F-1 Score is the harmonic mean of the Precision and Recall

  • For more emphasis on Recall or Precision maybe the F-2 score or F beta score

  • Remember: Recall is about creating a boundary where events mostly happen in the high risk/propensity region. Precision is about getting as high a possible rate of events in our high risk/propensity region.

  • Partial ROC AuC ; Weighted ROC AuC …

To wrap up

  • Using classification tools for propensity modelling may need a bit more thought. Data science is a developing field, all the methodological questions have not been answered

  • Unbalanced datasets have more challenges that balanced datasets

  • Although we mostly discussed PR versus RoC, there may be something better for your particular task

  • You can always present results to others using well understood metric - even if you use a less well known but more optimal metric in model/score building

Random Links

Choose your own Loss functions ?

Start your own adventure…